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Abstract 

The incorrect alignments are a severe problem in variant calling, and remain as a challenge 
computational issue in Bioinformatics field. Although there have been some methods utilizing the 
re-alignment approach to tackle the misalignments, a standalone re-alignment tool for long 
sequencing reads is lacking. Hence, we present a standalone tool to correct the misalignments, 
called ProbAlign. It can be integrated into the pipelines of not only variant calling but also other 
genomic applications. We demonstrate the use of re-alignment in two diverse and important 
genomics fields: variant calling and viral quasispecies reconstruction. First, variant calling results 
in the Pacific Biosciences SMRT re-sequencing data of NA12878 show that false positives can be 
reduced by 43.5%, and true positives can be increased by 24.8% averagely, after re-alignment. 
Second, results in reconstructing a 5-virus-mix show that the viral population can be completely 
unraveled, and also the estimation of quasispecies frequencies has been improved, after re- 
alignment. ProbAlign is freely available in the PyroTools toolkit 
( https://github.com/homopolymer/PyroTools ). 
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Introduction 

The advent of high throughput sequencing (HTS) technologies has revolutionized the discovery of 
genetic changes in genome sequence. Various sizes of samples, such as a set of targeted regions in 
a genome, or the entire genome of a sample, even the genomes of a pool of samples, can be 
completely sequenced in days or weeks, producing huge data of gigabase-size. In a genome, 
genetic variants, e.g. single nucleotide polymorphisms (SNPs), multiple nucleotide polymorphisms 
(MNPs), short insertions and deletions (InDels), can be pinpointed at the base pair (bp) resolution 
by using the re-sequencing or targeted sequencing technique. Leveraging on the multiplex 
indexing technique, a population of tens of samples can be sequenced by using the same 
sequencing runs and completed in feasible time with affordable cost. Population sequencing data 
gears up the genotyping of genetic variants in a population, and expedites the estimation of the 
distribution of genetic variants in the population, which eventually facilitating the dissection of 
population structure [1], the sketching of fitness landscape [2], the reconstruction of the clonal 
evolution [3], etc. 

There are four kinds of sequencers that dominate the market, including Illumina [4], Roche/454 
[5], Ion Torrent [6], and Pacific Biosciences SMRT [7]. While all of these sequencers can produce a 
huge amount of sequencing reads, they are different at various aspects, e.g. reagent cost per base, 
data throughput, read length, error patterns, etc. The Illumina sequencer is commonly used in the 
large-scale genomics projects that require the sequencing of a large population or high sequencing 
depth, because it achieves the lowest per base sequencing cost and the largest data throughput. 
The length of Illumina sequencing reads is within the range from lOObp to 300bp. Although this 
interval covers most of microsatellite repeats and short interspersed nuclear elements (SINEs), 
short reads limit the ability of resolving the complex regions in genome, e.g. long interspersed 
nuclear elements (LINEs), of which the size is from 500bp to 8000bp [8]. Roche/454 and Ion 
Torrent sequencers can produce the mediate length of sequencing reads ranging from 300bp to 
800bp. Pacific Biosciences SMRT sequencer, which using third-generation single molecule 
sequencing technology, can produce long reads up to lOkilo base pairs. Mediate and long reads 
are less sensitive to the repetitive regions, and then achieve the better mappability than short 
reads, so as to move forward the reconstruction of complex regions in genome [9]. Besides the 
discrepancy at read length, the error patterns of these sequencers are different. The Illumina 
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sequencer is prone to substitute the underlying bases with erroneous bases [10]. In Roche/454 
and Ion Torrent data, the insertions and deletions are prevalent because they are imprecise at the 
determination of the length of homopolymer runs [11]. The insertions and deletions are also the 
dominant error patterns in the Pacific Biosciences SMRT data [12]. All of these four kinds of 
sequencers play the important roles in the genomic research fields. 

Variant calling is the first step of HTS applications, and proceeds in the following stages. First, 
sequencing reads are mapped onto the reference genome, and piled up on the genome. Second, 
the differences are detected by comparing sequencing reads with the genome. Among the called 
variants, the false discoveries are resulted from three confounding factors: sequencing errors, 
alignment biases, and mapping artifacts, (i) Base quality score and error modeling are employed 
in both Likelihood- and Bayesian-based variant calling methods in order to recognize the truth 
genetic signals and eliminate sequencing errors [13-15]. (ii) The gaps in the short tandem 
repeats, e.g. homopolymer runs and dinucleotide repeats, are usually placed at wrong positions, 
then resulting in the incorrect alignments and also the false positive variants. It is a challenge to 
compute the correct alignments in these low complexity regions, because (a) sequencing reads 
vary in a large range at determining the number of repetitive elements, and (b) there is no any 
proper scoring function, which reflecting error patterns of a sequencer. Alignment biases are 
pervasive in the mediate and long sequencing reads, (iii) The repetitive elements in genome, e.g. 
satellites, rDNAs, SINEs, LINEs, Alu repeats, segmental duplications, etc., place a problem at 
searching for the original positions of sequencing reads, especially for short reads. A short read is 
often randomly assigned to a position by a mapping algorithm, if it could be mapped to multiple 
regions of the similar repetitive structure. The differences between the repetitive regions could 
be reported as genetic variants, eventually increasing the false discovery rate. Moreover, the 
alignment biases could place the truth variants at random positions, which decreasing the number 
of the variant supporting reads in the pileup, and then result in the false negatives. The mapping 
artifacts that place reads in wrong positions are also a resource of the false negatives. 

Haplotype-based variant calling has been reported as an efficient variant calling method [16-19]. 
The computation steps include: (1) generating the candidate haplotypes, (2) computing the new 
alignments between the sequencing reads and candidate haplotypes, which is also called re- 
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alignment, and (3) inferring the underlying genotype. The improvement of haplotype-based 
methods in variant calling relies on the following advantages, (i) First, haplotype-based methods 
employ the graph technique, e.g. de Bruijn graph [20] or partial order graph [21], to reconstruct 
the consensus sequences (haplotypes) within a region of interest (ROI). The strength of the graph 
technique is to remove the spurious signals by using the redundancy in the pileup of sequencing 
reads. The underlying genotype is a combination of the consensus sequences, which can explain 
the largest portion of the aligned reads in the ROI region. As a consequence of the mapping 
artifacts, the number of unique reads in a ROI region could be more than the expected number of 
haplotypes, e.g. two haplotypes for a diploid sample. The genotype inference places the 
constraints on the number of haplotypes, and then eliminates the mapping artifacts, (ii) Second, 
haplotype-based methods approximate the truth alignments of the sequencing reads by 
enumerating over all possible haplotypes. The misalignments can be corrected if sequencing 
reads were aligned to their original haplotypes. 

However, the re-alignment is commonly implemented as an internal workhorse inside the 
haplotype-based variant calling methods, and it is not able to access to the re-alignment results 
from the exterior. GATK provides a tool, which called IndelRealigner, to adjust the alignments 
around the insertions/deletions, but it is restricted to process the Illumina sequencing data [14]. 
SRMA is another re-alignment tool, but its speed is a critical issue and cannot work on Pacific 
Biosciences SMRT data [22]. To our knowledge, there is no any standalone re-alignment tool that 
working for long read sequencers under the feasible running time in the Bioinformatics field. The 
high quality alignments can also benefit other genomic applications, e.g. the detection of short 
tandem-repeat variation [23] and viral quasispecies reconstruction [24]. There are two 
difficulties in the implementation of the re-alignment. First, the re-alignment is a time-consumed 
computation, because it requires the computation of the pairwise alignments between sequencing 
reads and all possible haplotypes. Second, there is no any proper scoring function that can 
correctly score the alignments for sequencing reads, of which sequencing errors vary in a diverse 
range. 

In this paper, we presented a standalone re-alignment tool, called ProbAlign, for long read 
sequencing technologies. ProbAlign addressed the above two difficulties. First, ProbAlign 
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employs the graph technique to represent the consensus sequences in a ROI region, which 
resembles the partial order graph or variant graph. ProbAlign computes the new alignment of a 
sequencing read by aligning the read to the graph. The time complexity is 0(n|K|), n is read 
length, \V\ is the number of vertices in the graph. As a comparison, the time complexity of the re- 
alignment in the haplotype-based methods is 0(Amri), X is the number of haplotypes, m is 
haplotype length, and n is read length. Hence, the time complexity is reduced by an order of 
magnitude. Second, ProbAlign employs the conditional random field (CRF) technique to model 
error patterns in the sequencing reads. The parameters of the CRF model can be estimated from 
the BAM file at hand. This enables the proper scoring of the correct alignments for sequencing 
reads, which are contaminated by various sequencing errors. 

Results 

Overview of ProbAlign 

The rationale of the re-alignment method is stated in the following. Assume that a bucket of reads 
have been aligned into a multiple sequence alignment (MSA), of which a row is an aligned read 
and a column is an aligned position. The computational task is to align a new read onto the MSA. 
The strategy, which implemented in ProbAlign, is to pick up a sequence in the MSA, and then align 
the new read to the picked sequence. The sequence is picked such that sequencing errors in the 
sequence are most similar to those in the new read. 

As shown in Fig. 1, ProbAlign defines low complexity regions, e.g. homopolymer runs and 
dinucleotide repeats, and putative multiple nucleotide polymorphism (MNP) loci as the regions of 
interest (ROIs). (i) First, a DAG is established from the raw mapping results in a ROI. In the DAG, 
A vertex represents one allele in an aligned position. An edge links two vertices that are occurred 
in an aligned read, (ii) Second, the vertices in the DAG are sorted in the topological order, and a 
depth-first search (DFS) algorithm is deployed to find the maximally weighted path(s) in the graph, 
which are inferred as the consensus sequence(s) in the ROI. (iii) Next, a new DAG is constructed 
from the reference sequence and the consensus sequence(s). This DAG is named as the consensus 
graph, because every path is a consensus sequence, (iv) Then, ProbAlign iterates through the 
sequencing reads in the ROI, and the alignments of sequencing reads against the consensus graph 
are computed by using a Viterbi algorithm. Once the new alignment of a sequencing read is better 
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than the original alignment, e.g. the number of mismatches declines, and then the new alignment 
replaces the old alignment in the given BAM file. Also, the consensus graph is updated by adding 
new vertices and/or edges, which representing the variants in the new alignment. The reason of 
the graph update is that the number of short repetitive elements varies in a diverse range in long 
sequencing reads, the alignment quality can be improved if the graph has the local structures 
describing the large deletions/insertions in low complexity regions, (v) Finally, the CIGAR strings 
and MD tags of the re-alignments are stored in the new BAM file. Two examples of the re- 
alignment are shown in Fig. 2. More re-alignment examples could be found in the supplemental 
materials. 

Application I: SNP calling of NA12878 

The alignment files of Pacific Biosciences SMRT re-sequencing data (unpublished) of NA12878 
that sequenced by Mount Sinai School of Medicine are publicly available in the 1000 Genomes 
Project FTP site. The alignment files of Roche/454 re-sequencing data of NA12878 was also 
downloaded from the 1000 Genomes Project FTP site. The Ion Torrent re-sequencing reads of 
NA12878 were downloaded from the National Center for Biotechnology Informatics (NCBI) Short 
Read Archive (SRA) (Project Accession Number: PRJNA162355). The Ion Torrent sequencing 
reads were mapped to the human genome (version: hgl9) by using the Burrows-Wheeler Aligner 
(BWA) MEM algorithm [25] with the default option settings. 

The alignments in the BAM files of Roche/454, Ion Torrent, and Pacific Biosciences SMRT reads 
were adjusted by using ProbAlign in the default settings. The SNP calling software Samtools [13] 
was deployed to call SNPs in both the original and adjusted BAM files. The VCF files of SNP calling 
results were compared with the National Institute of Standards and Technology (NIST) GIAB high- 
confidence benchmark calls (v2.18), which constructed by the Genome in a Bottle (GIAB) 
consortium by integrating the variant calls from five sequencers, including Illumina, Roche/454, 
Ion Torrent, SOLiD and Complete Genomics [26]. The evaluation measures, e.g. the number of true 
positives, the number of false positives, and the false discovery rate (FDR), were calculated by 
using the program VCFComparator in USeq toolkit. 
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Taking Pacific Biosciences SMRT data for example, the number of sequencing reads that mapped 
to chromosome 21 is 661553, of which the average read length is 1805bp. ProbAlign found 70581 
ROIs in chromosome 21, and re-aligned 244986 (37.0%) sequencing reads. Both the original BAM 
file and re-aligned BAM file were input to Samtools to call SNPs by using the same options. When 
the cutting threshold of variant posterior probability was set as 0.5, which implying that variant 
quality score should be greater than 3.02, Samtools called 41348 true positives and 1815 false 
positives in the re-aligned BAM file. As a comparison, The Samtools caller detected 41322 true 
positives and 3214 false positives in the original BAM file. As a result of re-alignment, the true 
positives were increased by an amount of 26 (0.1%), and the number of true negatives was 
decreased by 1399 (43.5%). We plotted the increment of true positives in Fig. 3a, by varying the 
cutting threshold of variant quality score. It is noted that the re-alignment improved the 
probability of truth SNPs being detected, especially increasing the number of high quality truth 
SNPs. As shown in Fig. 3a, the average relative increment percentage reaches 24.8%. We also 
plotted the decrement of false positives along with the variant quality score in Fig. 2b. It revealed 
that the re-alignment efficiently suppress the false positive detection. The re-alignment method 
helps decrease a considerable number of false positives in the low quality SNPs, e.g. over 40% of 
false positives have been eliminated at the variant quality score threshold of 10. Although the 
figure of the relative decrement percentage shows that the re-alignment method harasses the 
detection of false positives in the high quality calls, the absolute number is below 5 and could be 
overlooked. Generally speaking, the re-alignment decreases the false discovery rate (Fig. 3c), as 
well as increases the capability at detecting high quality variants. In both Roche/454 and Ion 
Torrent data, the false discovery rates were also reduced. Those results can be found in the 
supplemental materials. 

Application II: HIV-1 quasispecies reconstruction 

We also evaluated the performance of ProbAlign in viral population analysis. We used a public 
HIV-1 benchmark data including five HIV-1 strains, which are HIV-ls9.6, HIV-1hxb2, HIV-Ijr-csf, HIV- 
1nl43, and HIV-1yu2 [27]. The proportion of these five strains in the mix was measured by the data 
provider who using the single-genome amplification (SGA) method. The SGA measures are used as 
the golden standard to evaluate the goodness of HIV-1 quasispecies reconstruction. We used the 
BWA MEM algorithm to map the Pacific Biosciences SMRT sequencing data onto the genome 
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sequence of HIVhxb2. The Pacific Biosciences SMRT sequencing data harbors unexpected high rate 
of the indels, such that it is difficult to reconstruct the quasispeices in the mix. Hence, we deployed 
ProbAlign to adjust the alignments in the region of the gene gp41 that starting from 7758 to 8795 
in the HIVhxb2 genome. We utilized the haplotype inference program PredictHaplo [28] to 
reconstruct the quasispecies in this region. As shown in Table 1, PredictHaplo reconstructed 4 
quasispecies, if the alignments were not adjusted. The strain HIVhxb2 is missed in the 
reconstructed population. After using ProbAlign to adjust the alignments, PredictHaplo 
successfully reconstructed all sequences of those five HIV-1 strains. Also, the re-alignment 
improved the estimation of viral quasispecies frequencies. After re-alignment, the root-mean- 
square error (RMSE) of frequency estimation is reduced from 0.101 to 0.058. Therefore, besides 
the SNP calling, the re-alignment also helps resolving the structure of viral population. 

Discussions 

Where to re-alignment? 

What difficulties remain in variant calling? 

How to set the scoring function? 

Computational time 

Conclusions 

Methods 

Conditional random field 

Conditional random field (CRF) is a machine learning method for the processing and manipulation 
of sequential data, e.g. next-generation sequencing reads. CRF, like hidden Markov model (HMM), 
can compute the alignment of two sequences by using the Viterbi algorithm [29]. And also, CRF 
provides more flexible utilities at modeling sequence context in the alignment. Three kinds of 
distinct feature functions are employed in the proposed CRF model. The feature functions are 
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defined as f(x, y, ay), of which a if j is the alignment of two subsequences X[ 1:£ ] and y^-jy If a 
feature appears in the alignment, then f(x, y, a i; ) = 1. Otherwise, f(x, y, a i; ) = 0. (1) First, one 
kind of these feature functions is to describe the hidden state transition between two contiguous 
positions, e.g. previous state is match and current state is insert. (2) The second type of feature 
functions is the nucleotide emission in the alignment, e.g. x t — A emits y ; - = A at the hidden state 
match. (3) The rest of feature functions are designed to formulate the homopolymer 
insertions/deletions in the alignment, e.g. the largest length is 5bp and the difference is lbp in the 
alignment of X[i_ 4: j] = AAAA to Yn-s-j] = AAAAA at the hidden state insert. 

There are 76 feature functions in the CRF model in total, T = {f k | 1 < k < 76}. A weight w k is 
assigned to a feature function f k to reflect the importance of the feature. Then, the scoring of an 
alignment a is 

S (a) = ) w k f k (x, y, a w ) - log z (x, y) 

The partition function z(x, y) = £ a exp(X £;k w k / k (x,y, a £>7 -)) is computed by using the forward 
algorithm. The value of feature function weights w k are estimated by using the iterative learning 
method and quasi-Newton optimum searching technique [30]. 

Graph based re-alignment 

The graph based multiple sequence alignment (MSA) was early proposed to resolve the gap 
representation problem of the consensus/profile based MSA that usually charges the gap 
incorrectly [21]. Recently, the graph based MSA technique has been re-invented to reconstruct the 
consensus sequence(s) from high throughput sequencing data, because the graph provides the 
structure to represent the linkage information between nearby variants, and is capable to 
eliminate random errors in the noise-contaminated sequencing data. This idea was employed in 
recent developed Bioinformatics tools, which are used to detect variants [18] or assemble long 
reads of high error rate [31]. The graph technique was also employed to adjust the alignments 
around the insertions/deletions, by leveraging on the consensus structure of the pileup of aligned 
reads [22]. However, the scoring function is a critical issue that the trivial scoring function cannot 
resolve sequencing errors in complex regions, e.g. the tandem repetitive regions. Therefore, a 
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contribution of this paper is to propose a machine learning method to estimate the scoring 
function from high-throughput sequencing data. Another contribution is to employ the graph- 
based MSA to resolve the misalignments around tandem repeats, e.g. mono- and di-nucleotide 
repeats. 

The re-alignment method is stated in the following steps. (1) The re-alignment method starts 
from the raw alignments that are generated by a mapping program. A graph is constructed from 
the raw alignments, of which the vertices are nucleotides in the alignments. Two vertices are 
linked by a directed edge if these two adjoining nucleotides occur in an alignment. The spurious 
edges (e.g. the edge occurs once in data) are pruned away. If the raw alignments were consistent, 
the re-alignment method would halt. Otherwise, the re-alignment method will proceed. The raw 
alignments are inconsistent if there are at least two paths in the graph, of which the path labels are 
the same. (2) Once the raw alignments are inconsistent, the top ranked k paths are chosen as the 
possible underlying consensus sequences, and the rest of paths are removed from the graph, 
excluding the reference. Then, all vertices in the graph are sorted by topological ordering. (3) A 
read is aligned against the graph by using the Viterbi algorithm. After the computation, a path in 
the graph is chosen as the aligned template, and the new CIGAR flag of the alignment between the 
read and template is reported. And then, the graph is updated by the computed alignment. This 
step is repeated until that all reads are aligned to the graph. 

Datasets and other programs 

Pacific Biosciences SMRT sequencing data of NA12878 

ftp://ftp.10QQgenomes.ebi.ac.uk/voll/ftp/technical/working/2Q1312Q9 na!2878 pacbio/Schadt 
/ali gnment/ 

Roche/454 sequencing data of NA12878 

http://ftp.lQQ0genomes.ebi.ac.uk/voll/ftp/technical/pilot2 high cov GRCh37 bams/data/NA12 
878/alignment/ 



GIAB high-confident benchmark calls of NA12878 
ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/ 



Downloaded from http://biorxiv.org/on September 18, 2014 



5-virus-mix 

https://github.com/armintoepfer/5-virus-mix 
USeq toolkit 

http://sourceforge.net/projects/useq/ 
PredictHaplo 

http://bmda.cs.unibas.ch/HivHaploTyper/ 
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Table 1. Reconstruction of HIV gene gp41 in 5-virus-mix 
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Figure 1. Workflow of ProbAlign. ProbAlign adjusts the misalignments around homopolymer 
runs, dinucleotide repeats, and multiple nucleotide polymorphisms, (a) A region of interest 
includes a homopolymer run of length 13bp, and also harbors a number of the incorrect 
alignments, (b) An alignment graph is constructed from the pileup. Red vertices in the backbone 
are the bases in reference genome. Blue vertices are the mismatches. Green vertices are the 
inserted bases. Dashed edges represent insertions and deletions. Weight of an edge e = (u, v) is 
computed by the formula, w(e) = occurrence^) ^ consensus g ra ph is constructed 

L„l rT„i?j „„r,^OCCUrrence(e ) 

e ElnEdgeiu) v 3 

from the alignment graph, (d) It is an alignment of a read sequence (colored blue) against the 
consensus graph (colored red). 
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Figure 2. Re-alignment examples of long sequencing reads. The tracks from top to bottom are the 
original BAM file, the re-aligned BAM file, and human genome, (a) An example of re-alignment 
increasing true positive detection. The truth SNP at chr21:10710102 is rescured by replacing the 
misalignments, (b) An example of re-alignment reducing false positive detection. The false SNP at 
chr21:9834810 is reduced by replacing the misalignments. 

Figure 3. The increment of true positives, the decrement of false positives, and FDR curve, (a) The 
increment of true positives, red bars are the absolute changes, and green bars are the relative 
changes, (b) The decrement of false positives, red bars are the absolute changes, and green bars 
are the relative changes, (c) FDR curve. 
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